Welcome to the Intro to R Programming Workshop!
This is Part II. For a link to Part I click here.
Link to Slides: https://favstats.github.io/ds3_r_intro/
Packages are at the heart of R:
R packages are basically a collection of functions that you load into your working environment.
They contain code that other R users have prepared for the community.
It’s good to know your packages, they can really make your life easier.
I suggest keeping track of package developments either on Twitter via #rstats
You can install packages in R like this using the install.packages function:
install.packages("janitor")
However, installing is not enough. You also need to load the package via library.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Think of install.packages as buying a set of tools (for free!) and library as pulling out the tools each time you want to work with them.
tidyverse?The tidyverse describes itself:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
We have already seen tidy data:
| Animal | Maximum Lifespan | Animal/Human Years Ratio |
|---|---|---|
| Domestic dog | 24.0 | 5.10 |
| Domestic cat | 30.0 | 4.08 |
| American alligator | 77.0 | 1.59 |
| Golden hamster | 3.9 | 31.41 |
| King penguin | 26.0 | 4.71 |
| Animal | Type | Value |
|---|---|---|
| Domestic dog | lifespan | 24.0 |
| Domestic dog | ratio | 5.10 |
| Domestic cat | lifespan | 30.0 |
| Domestic cat | ratio | 4.08 |
| American alligator | lifespan | 77.0 |
| American alligator | ratio | 1.59 |
| Golden hamster | lifespan | 3.9 |
| Golden hamster | ratio | 31.41 |
| King penguin | lifespan | 26.0 |
| King penguin | ratio | 4.71 |
The data above has multiple rows with the same observation (animal).
= not tidy
| Animal | Lifespan/Ratio |
|---|---|
| Domestic dog | 24.0 / 5.10 |
| Domestic cat | 30.0 / 4.08 |
| American alligator | 77.0 / 1.59 |
| Golden hamster | 3.9 / 31.41 |
| King penguin | 26.0 / 4.71 |
The data above has multiple variables per column.
= not tidy
Artist: Allison Horst
Tidy data has two decisive advantages:
Consistently prepared data is easier to read, process, load and save.
Many procedures (or the associated functions) in R require this type of data.
Artist: Allison Horst
First we install the packages of the tidyverse like this. In Google Colab we actually don’t need to install the tidyverse because it comes pre-installed!
install.packages("tidyverse")
Then we load them:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.0 v dplyr 1.0.1
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We are going to work with a new data from here on out.
No worries, we will stay within the animal kingdom but we need a dataset that is a little more complex than what we have seen already.
Meet the Palmer Station penguins!
Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER.
Artist: Allison Horst
We could install the R package palmerpenguins and then access the data.
However, we are going to use a different method: directly load a .csv file (comma-separated values) into R from the internet.
We can use the readr package which provides many convenient functions to load data into R. Here we need read_csv:
penguins_raw <- read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins_raw.csv")
## Parsed with column specification:
## cols(
## studyName = col_character(),
## `Sample Number` = col_double(),
## Species = col_character(),
## Region = col_character(),
## Island = col_character(),
## Stage = col_character(),
## `Individual ID` = col_character(),
## `Clutch Completion` = col_character(),
## `Date Egg` = col_date(format = ""),
## `Culmen Length (mm)` = col_double(),
## `Culmen Depth (mm)` = col_double(),
## `Flipper Length (mm)` = col_double(),
## `Body Mass (g)` = col_double(),
## Sex = col_character(),
## `Delta 15 N (o/oo)` = col_double(),
## `Delta 13 C (o/oo)` = col_double(),
## Comments = col_character()
## )
penguins_raw
## # A tibble: 344 x 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie~ Anvers Torge~ Adul~ N1A1
## 2 PAL0708 2 Adelie~ Anvers Torge~ Adul~ N1A2
## 3 PAL0708 3 Adelie~ Anvers Torge~ Adul~ N2A1
## 4 PAL0708 4 Adelie~ Anvers Torge~ Adul~ N2A2
## 5 PAL0708 5 Adelie~ Anvers Torge~ Adul~ N3A1
## 6 PAL0708 6 Adelie~ Anvers Torge~ Adul~ N3A2
## 7 PAL0708 7 Adelie~ Anvers Torge~ Adul~ N4A1
## 8 PAL0708 8 Adelie~ Anvers Torge~ Adul~ N4A2
## 9 PAL0708 9 Adelie~ Anvers Torge~ Adul~ N5A1
## 10 PAL0708 10 Adelie~ Anvers Torge~ Adul~ N5A2
## # ... with 334 more rows, and 10 more variables: `Clutch Completion` <chr>,
## # `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>, `Delta 15 N
## # (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
glimpseWe can also take a look at data set using the glimpse function from dplyr.
glimpse(penguins_raw)
## Rows: 344
## Columns: 17
## $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "...
## $ `Sample Number` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
## $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adeli...
## $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anve...
## $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torge...
## $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "...
## $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2...
## $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No...
## $ `Date Egg` <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-...
## $ `Culmen Length (mm)` <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2,...
## $ `Culmen Depth (mm)` <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6,...
## $ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 1...
## $ `Body Mass (g)` <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675,...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MA...
## $ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9...
## $ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25....
## $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Ad...
janitorjanitor is not offically part of the tidyverse package compilation but in my view it is incredibly important to know.
Provides some convenient functions for basic cleaning of the data.
Just like any tidverse-style package it fullfills the following criteria for its functions:
The data is always the first argument.
This helps us to match by position.
install.packages("janitor")
library(janitor)
clean_names()One annoyance with the penguins_raw data is that it has spaces in the variable names. Urgh!
R has to put quotes around the variable names that have spaces:
penguins_raw$`Delta 15 N (o/oo)`
## [1] NA 8.94956 8.36821 NA 8.76651 8.66496 9.18718 9.46060
## [9] NA 9.13362 8.63243 NA NA NA 8.55583 NA
## [17] 9.18528 8.67538 8.47827 9.11616 8.73762 8.66271 9.22286 8.43423
## [25] 9.63954 9.21292 8.93997 8.08138 8.38404 8.90027 9.69756 9.72764
## [33] 9.66523 8.79665 9.17847 9.15308 9.18985 8.04787 9.41131 NA
## [41] 9.68933 NA 9.50772 9.23720 9.36392 9.49106 NA NA
## [49] 9.51784 8.87988 8.46616 8.51362 8.19539 8.48095 8.41837 8.35396
## [57] 8.57199 8.56674 9.07878 9.10800 8.96472 8.74802 8.58063 8.62264
## [65] 8.62623 8.85562 8.56192 8.71078 8.47781 8.86853 7.88863 9.29808
## [73] 8.33524 8.18658 8.70642 8.29930 8.47257 8.35540 7.82381 9.05736
## [81] 7.69778 8.63259 7.88494 8.90002 8.32718 9.14863 8.57087 8.59147
## [89] 9.07826 8.36936 8.46531 8.77018 8.01485 8.49915 8.90723 8.48204
## [97] 8.10277 8.39459 9.04218 8.97025 8.84451 9.01079 9.21510 9.51929
## [105] 9.02642 8.85699 8.77322 9.59245 9.79532 9.31735 8.43951 8.65466
## [113] 9.02657 8.80186 8.80967 8.91434 9.18021 9.49645 8.96436 9.32277
## [121] 9.04296 9.11066 9.30722 9.59462 8.81668 9.22537 8.88098 8.52566
## [129] 9.19031 9.10702 8.98460 8.86495 8.98705 8.56708 8.71700 8.94365
## [137] 8.75984 8.95998 8.61651 9.25769 9.28810 9.23408 8.79787 9.05674
## [145] 9.06829 9.22033 9.11006 8.68744 8.94332 8.97533 8.93465 8.89640
## [153] 7.99300 8.14756 8.14705 8.25540 8.23450 7.99530 8.24515 8.22673
## [161] 8.13643 8.16310 8.19579 8.10417 7.77672 7.82080 7.79958 8.07137
## [169] 7.63884 8.27376 7.84057 7.96491 7.89620 7.63220 7.90436 7.90971
## [177] 7.68528 7.83733 7.96621 7.92358 7.68870 8.30515 NA 7.63452
## [185] 7.97408 7.76843 7.89744 8.03659 7.96935 8.13746 8.01979 8.14776
## [193] 8.14567 8.38324 8.37615 8.26548 8.46894 8.27141 8.47829 8.65803
## [201] 8.45167 8.55868 8.38289 8.39867 8.51951 8.50153 8.48789 8.63488
## [209] 8.58319 8.63604 8.48367 8.74647 8.65015 8.60092 8.62870 8.49662
## [217] 8.60447 8.47067 8.24253 8.49854 8.64931 8.63551 8.53018 8.35078
## [225] 8.24651 8.58487 8.47938 8.59640 8.39299 8.40327 8.24694 8.19749
## [233] 8.35802 8.28601 8.19101 8.20042 8.11238 8.27428 8.23468 8.15426
## [241] 8.12691 8.27595 8.29671 8.36701 8.15566 8.83352 8.20106 8.27102
## [249] 8.03624 7.88810 8.16582 8.20660 8.10231 8.31180 8.30817 8.65914
## [257] 8.25818 8.32359 8.12311 8.41017 8.42070 8.45738 8.24691 8.29226
## [265] 8.21634 8.78557 8.30231 8.08354 8.04111 8.33825 7.99184 NA
## [273] 8.41151 8.30166 8.24246 8.36390 9.03935 8.92069 9.29078 8.64701
## [281] 9.00642 8.88942 8.85664 8.63701 8.47173 8.79581 8.95063 8.68747
## [289] 8.72037 9.02330 9.12277 9.80590 10.02019 9.14382 9.32105 9.27158
## [297] 9.35138 9.42666 9.35416 9.28153 9.74144 9.36799 8.93990 9.63074
## [305] 9.37369 9.25177 9.08458 9.49283 9.36668 9.23196 9.75486 9.07825
## [313] 8.83502 9.43146 9.80589 10.02544 9.53262 9.61734 10.02372 9.36493
## [321] 9.43684 9.45827 9.46819 9.34089 9.68950 9.32169 9.46929 9.43782
## [329] 9.41500 9.93727 9.56534 9.77528 9.62357 9.88809 9.74492 9.46985
## [337] NA 9.65061 9.26715 9.70465 9.37608 9.46180 9.98044 9.39305
penguins_raw$`Flipper Length (mm)`
## [1] 181 186 195 NA 193 190 181 195 193 190 186 180 182 191 198 185 195 197
## [19] 184 194 174 180 189 185 180 187 183 187 172 180 178 178 188 184 195 196
## [37] 190 180 181 184 182 195 186 196 185 190 182 179 190 191 186 188 190 200
## [55] 187 191 186 193 181 194 185 195 185 192 184 192 195 188 190 198 190 190
## [73] 196 197 190 195 191 184 187 195 189 196 187 193 191 194 190 189 189 190
## [91] 202 205 185 186 187 208 190 196 178 192 192 203 183 190 193 184 199 190
## [109] 181 197 198 191 193 197 191 196 188 199 189 189 187 198 176 202 186 199
## [127] 191 195 191 210 190 197 193 199 187 190 191 200 185 193 193 187 188 190
## [145] 192 185 190 184 195 193 187 201 211 230 210 218 215 210 211 219 209 215
## [163] 214 216 214 213 210 217 210 221 209 222 218 215 213 215 215 215 216 215
## [181] 210 220 222 209 207 230 220 220 213 219 208 208 208 225 210 216 222 217
## [199] 210 225 213 215 210 220 210 225 217 220 208 220 208 224 208 221 214 231
## [217] 219 230 214 229 220 223 216 221 221 217 216 230 209 220 215 223 212 221
## [235] 212 224 212 228 218 218 212 230 218 228 212 224 214 226 216 222 203 225
## [253] 219 228 215 228 216 215 210 219 208 209 216 229 213 230 217 230 217 222
## [271] 214 NA 215 222 212 213 192 196 193 188 197 198 178 197 195 198 193 194
## [289] 185 201 190 201 197 181 190 195 181 191 187 193 195 197 200 200 191 205
## [307] 187 201 187 203 195 199 195 210 192 205 210 187 196 196 196 201 190 212
## [325] 187 198 199 201 193 203 187 197 191 203 202 194 206 189 195 207 202 193
## [343] 210 198
janitor can help with that:
using a function called clean_names()
clean_names() just magically turns all our messy column names into readable lower-case snake case:
penguins_clean <- clean_names(penguins_raw)
That is how the variables look like now:
glimpse(penguins_clean)
## Rows: 344
## Columns: 17
## $ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0...
## $ sample_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pe...
## $ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers",...
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen...
## $ stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adul...
## $ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "...
## $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "...
## $ date_egg <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16, ...
## $ culmen_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34....
## $ culmen_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18....
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ...
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347...
## $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE",...
## $ delta_15_n_o_oo <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.187...
## $ delta_13_c_o_oo <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.2980...
## $ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult ...
remove_constant()Now we have another problem. Not all variables in the penguins_clean data set are that useful.
Some of them are the same across all observations. We don’t need those variables, like region.
table(penguins_clean$region)
##
## Anvers
## 344
We can use the base R function table to quickly get some tabulations of our variable.
Here to help get rid of these constant columns is the function remove_constant().
penguins_clean <- remove_constant(penguins_clean, quiet = F)
## Removing 2 constant columns of 17 columns total (Removed: region, stage).
When we set quiet = F we even get some info about what exactly was removed. Neat!
Another useful function in janitor is remove_empty() which removes all rows or columns that just consist of missing values (i.e. NA)
tidyrNow we are already fairly advanced in our tidying.
But our dataset is still not entirely tidy yet.
Consider the species variable:
table(penguins_clean$species)
##
## Adelie Penguin (Pygoscelis adeliae)
## 152
## Chinstrap penguin (Pygoscelis antarctica)
## 68
## Gentoo penguin (Pygoscelis papua)
## 124
This variable violates the tidy rule that each cell should include a single value.
Species hold both the common name and the latin name of the penguin.
separate()We can use a tidyr function called separate() to turn this into two variables.
Two arguments are important for that:
sep: specifies by which character the value should be splitinto: a vector which specifies the resulting new variable namesIn our case we want to split by an empty space and opening bracket \\( and will name our variables species and latin_name:
penguins_clean <- separate(penguins_clean, species, sep = " \\(", into = c("species", "latin_name"))
penguins_clean
## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie~ Pygosceli~ Torge~ N1A1
## 2 PAL0708 2 Adelie~ Pygosceli~ Torge~ N1A2
## 3 PAL0708 3 Adelie~ Pygosceli~ Torge~ N2A1
## 4 PAL0708 4 Adelie~ Pygosceli~ Torge~ N2A2
## 5 PAL0708 5 Adelie~ Pygosceli~ Torge~ N3A1
## 6 PAL0708 6 Adelie~ Pygosceli~ Torge~ N3A2
## 7 PAL0708 7 Adelie~ Pygosceli~ Torge~ N4A1
## 8 PAL0708 8 Adelie~ Pygosceli~ Torge~ N4A2
## 9 PAL0708 9 Adelie~ Pygosceli~ Torge~ N5A1
## 10 PAL0708 10 Adelie~ Pygosceli~ Torge~ N5A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
Now there is still a trailing ) at the end of latin_name. We can remove that using the stringr package and more specifically the str_remove() function.
penguins_clean$latin_name <- str_remove(penguins_clean$latin_name, "\\)")
penguins_clean
## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie~ Pygosceli~ Torge~ N1A1
## 2 PAL0708 2 Adelie~ Pygosceli~ Torge~ N1A2
## 3 PAL0708 3 Adelie~ Pygosceli~ Torge~ N2A1
## 4 PAL0708 4 Adelie~ Pygosceli~ Torge~ N2A2
## 5 PAL0708 5 Adelie~ Pygosceli~ Torge~ N3A1
## 6 PAL0708 6 Adelie~ Pygosceli~ Torge~ N3A2
## 7 PAL0708 7 Adelie~ Pygosceli~ Torge~ N4A1
## 8 PAL0708 8 Adelie~ Pygosceli~ Torge~ N4A2
## 9 PAL0708 9 Adelie~ Pygosceli~ Torge~ N5A1
## 10 PAL0708 10 Adelie~ Pygosceli~ Torge~ N5A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
There is a also a function called unite() which works in the opposite direction.
Now our data is in tidy format!
We were in luck because the data pretty much already came in a format that was: 1 observation per row.
But what if that is not the case?
pivot_wider() and pivot_longer()tidyr also comes equipped to deal with data that has more that one observation per row.
The function to use here is called pivot_wider.
Now our penguin_clean data is already tidy.
But we can just read in a dataset that isn’t:
untidy_animals <- read_csv("https://github.com/favstats/ds3_r_intro/blob/main/data/untidy_animals.csv?raw=true")
## Parsed with column specification:
## cols(
## Animal = col_character(),
## Type = col_character(),
## Value = col_double()
## )
untidy_animals
## # A tibble: 10 x 3
## Animal Type Value
## <chr> <chr> <dbl>
## 1 Domestic dog lifespan 24
## 2 Domestic dog ratio 5.1
## 3 Domestic cat lifespan 30
## 4 Domestic cat ratio 4.08
## 5 American alligator lifespan 77
## 6 American alligator ratio 1.59
## 7 Golden hamster lifespan 3.9
## 8 Golden hamster ratio 31.4
## 9 King penguin lifespan 26
## 10 King penguin ratio 4.71
You may recognize this data from the subsection Untidy data I
Now let’s use pivot_wider to make every row an observation.
We need two main arguments for that:
names_from: tells the function where the new column names come fromvalues_from: tells the function where the values should come fromtidy_animals <- pivot_wider(untidy_animals, names_from = Type, values_from = Value)
tidy_animals
## # A tibble: 5 x 3
## Animal lifespan ratio
## <chr> <dbl> <dbl>
## 1 Domestic dog 24 5.1
## 2 Domestic cat 30 4.08
## 3 American alligator 77 1.59
## 4 Golden hamster 3.9 31.4
## 5 King penguin 26 4.71
pivot_longer can untidy our data again
The argument cols = tells the function which variables to turn into long format:
pivot_longer(tidy_animals, cols = c(lifespan, ratio))
## # A tibble: 10 x 3
## Animal name value
## <chr> <chr> <dbl>
## 1 Domestic dog lifespan 24
## 2 Domestic dog ratio 5.1
## 3 Domestic cat lifespan 30
## 4 Domestic cat ratio 4.08
## 5 American alligator lifespan 77
## 6 American alligator ratio 1.59
## 7 Golden hamster lifespan 3.9
## 8 Golden hamster ratio 31.4
## 9 King penguin lifespan 26
## 10 King penguin ratio 4.71
dplyrArtist: Allison Horst
select()helps you select variables
select() is part of the dplyr package and helps you select variables
Remember: with tidyverse-style functions, data is always the first argument.
Here we only keep individual_id, sex and species.
select(penguins_clean, individual_id, sex, species)
## # A tibble: 344 x 3
## individual_id sex species
## <chr> <chr> <chr>
## 1 N1A1 MALE Adelie Penguin
## 2 N1A2 FEMALE Adelie Penguin
## 3 N2A1 FEMALE Adelie Penguin
## 4 N2A2 <NA> Adelie Penguin
## 5 N3A1 FEMALE Adelie Penguin
## 6 N3A2 MALE Adelie Penguin
## 7 N4A1 FEMALE Adelie Penguin
## 8 N4A2 MALE Adelie Penguin
## 9 N5A1 <NA> Adelie Penguin
## 10 N5A2 <NA> Adelie Penguin
## # ... with 334 more rows
But select() is more powerful than that.
We can also remove variables with a - (minus).
Here we remove individual_id, sex and species.
select(penguins_clean, -individual_id, -sex, -species)
## # A tibble: 344 x 13
## study_name sample_number latin_name island clutch_completi~ date_egg
## <chr> <dbl> <chr> <chr> <chr> <date>
## 1 PAL0708 1 Pygosceli~ Torge~ Yes 2007-11-11
## 2 PAL0708 2 Pygosceli~ Torge~ Yes 2007-11-11
## 3 PAL0708 3 Pygosceli~ Torge~ Yes 2007-11-16
## 4 PAL0708 4 Pygosceli~ Torge~ Yes 2007-11-16
## 5 PAL0708 5 Pygosceli~ Torge~ Yes 2007-11-16
## 6 PAL0708 6 Pygosceli~ Torge~ Yes 2007-11-16
## 7 PAL0708 7 Pygosceli~ Torge~ No 2007-11-15
## 8 PAL0708 8 Pygosceli~ Torge~ No 2007-11-15
## 9 PAL0708 9 Pygosceli~ Torge~ Yes 2007-11-09
## 10 PAL0708 10 Pygosceli~ Torge~ Yes 2007-11-09
## # ... with 334 more rows, and 7 more variables: culmen_length_mm <dbl>,
## # culmen_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
These selection helpers match variables according to a given pattern.
starts_with(): Starts with a prefix.
ends_with(): Ends with a suffix.
contains(): Contains a literal string.
matches(): Matches a regular expression.
For example: let’s keep all variables that start with s:
select(penguins_clean, starts_with("s"))
## # A tibble: 344 x 4
## study_name sample_number species sex
## <chr> <dbl> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin MALE
## 2 PAL0708 2 Adelie Penguin FEMALE
## 3 PAL0708 3 Adelie Penguin FEMALE
## 4 PAL0708 4 Adelie Penguin <NA>
## 5 PAL0708 5 Adelie Penguin FEMALE
## 6 PAL0708 6 Adelie Penguin MALE
## 7 PAL0708 7 Adelie Penguin FEMALE
## 8 PAL0708 8 Adelie Penguin MALE
## 9 PAL0708 9 Adelie Penguin <NA>
## 10 PAL0708 10 Adelie Penguin <NA>
## # ... with 334 more rows
Select the first 5 variables:
select(penguins_clean, 1:5)
## # A tibble: 344 x 5
## study_name sample_number species latin_name island
## <chr> <dbl> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin Pygoscelis adeliae Torgersen
## 2 PAL0708 2 Adelie Penguin Pygoscelis adeliae Torgersen
## 3 PAL0708 3 Adelie Penguin Pygoscelis adeliae Torgersen
## 4 PAL0708 4 Adelie Penguin Pygoscelis adeliae Torgersen
## 5 PAL0708 5 Adelie Penguin Pygoscelis adeliae Torgersen
## 6 PAL0708 6 Adelie Penguin Pygoscelis adeliae Torgersen
## 7 PAL0708 7 Adelie Penguin Pygoscelis adeliae Torgersen
## 8 PAL0708 8 Adelie Penguin Pygoscelis adeliae Torgersen
## 9 PAL0708 9 Adelie Penguin Pygoscelis adeliae Torgersen
## 10 PAL0708 10 Adelie Penguin Pygoscelis adeliae Torgersen
## # ... with 334 more rows
Select everything from individual_id to flipper_length_mm.
select(penguins_clean, individual_id:flipper_length_mm)
## # A tibble: 344 x 6
## individual_id clutch_completi~ date_egg culmen_length_mm culmen_depth_mm
## <chr> <chr> <date> <dbl> <dbl>
## 1 N1A1 Yes 2007-11-11 39.1 18.7
## 2 N1A2 Yes 2007-11-11 39.5 17.4
## 3 N2A1 Yes 2007-11-16 40.3 18
## 4 N2A2 Yes 2007-11-16 NA NA
## 5 N3A1 Yes 2007-11-16 36.7 19.3
## 6 N3A2 Yes 2007-11-16 39.3 20.6
## 7 N4A1 No 2007-11-15 38.9 17.8
## 8 N4A2 No 2007-11-15 39.2 19.6
## 9 N5A1 Yes 2007-11-09 34.1 18.1
## 10 N5A2 Yes 2007-11-09 42 20.2
## # ... with 334 more rows, and 1 more variable: flipper_length_mm <dbl>
filter()helps you filter rows
Here we only keep penguins from the Island Dream.
filter(penguins_clean, island == "Dream")
## # A tibble: 124 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 31 Adelie~ Pygosceli~ Dream N21A1
## 2 PAL0708 32 Adelie~ Pygosceli~ Dream N21A2
## 3 PAL0708 33 Adelie~ Pygosceli~ Dream N22A1
## 4 PAL0708 34 Adelie~ Pygosceli~ Dream N22A2
## 5 PAL0708 35 Adelie~ Pygosceli~ Dream N23A1
## 6 PAL0708 36 Adelie~ Pygosceli~ Dream N23A2
## 7 PAL0708 37 Adelie~ Pygosceli~ Dream N24A1
## 8 PAL0708 38 Adelie~ Pygosceli~ Dream N24A2
## 9 PAL0708 39 Adelie~ Pygosceli~ Dream N25A1
## 10 PAL0708 40 Adelie~ Pygosceli~ Dream N25A2
## # ... with 114 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
%in%Here the %in% operator can come in handy again if we want to filter more than one island:
islands_to_keep <- c("Dream", "Biscoe")
filter(penguins_clean, island %in% islands_to_keep)
## # A tibble: 292 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 21 Adelie~ Pygosceli~ Biscoe N11A1
## 2 PAL0708 22 Adelie~ Pygosceli~ Biscoe N11A2
## 3 PAL0708 23 Adelie~ Pygosceli~ Biscoe N12A1
## 4 PAL0708 24 Adelie~ Pygosceli~ Biscoe N12A2
## 5 PAL0708 25 Adelie~ Pygosceli~ Biscoe N13A1
## 6 PAL0708 26 Adelie~ Pygosceli~ Biscoe N13A2
## 7 PAL0708 27 Adelie~ Pygosceli~ Biscoe N17A1
## 8 PAL0708 28 Adelie~ Pygosceli~ Biscoe N17A2
## 9 PAL0708 29 Adelie~ Pygosceli~ Biscoe N18A1
## 10 PAL0708 30 Adelie~ Pygosceli~ Biscoe N18A2
## # ... with 282 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
mutate()helps you create variables
mutate will take a statement like this:
variable_name = some_calculation
and attach variable_name at the end of the dataset.
Let’s say we want to calculate penguin bodymass in kg rather than gram.
We take the variable body_mass_g and divided by 1000.
pg_new <- mutate(penguins_clean, bodymass_kg = body_mass_g/1000)
We temporarily assign the dataset to pg_new just to check whether it worked correctly:
select(pg_new, bodymass_kg, body_mass_g)
## # A tibble: 344 x 2
## bodymass_kg body_mass_g
## <dbl> <dbl>
## 1 3.75 3750
## 2 3.8 3800
## 3 3.25 3250
## 4 NA NA
## 5 3.45 3450
## 6 3.65 3650
## 7 3.62 3625
## 8 4.68 4675
## 9 3.48 3475
## 10 4.25 4250
## # ... with 334 more rows
ifelseifelse() is a very useful function that allows to easily recode variables based on logical tests.
It’s basic functionality looks like this:
\[\color{red}{\text{ifelse}}(\color{orange}{\text{logical test}},\color{blue}{\text{what should happen if TRUE}}, \color{green}{\text{what should happen if FALSE}})\]
Here is a very basic example:
ifelse(1 == 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
## [1] "Pick me if test is TRUE"
ifelse(1 != 1, "Pick me if test is TRUE", "Pick me if test is FALSE")
## [1] "Pick me if test is FALSE"
Let’s use ifelse in combination with mutate.
Let’s create the variable sex_short which has a shorter label for sex:
pg_new <- mutate(penguins_clean, sex_short = ifelse(sex == "MALE", "m", "f"))
We temporarily assign the dataset to pg_new just to check whether it worked correctly:
select(pg_new, sex, sex_short)
## # A tibble: 344 x 2
## sex sex_short
## <chr> <chr>
## 1 MALE m
## 2 FEMALE f
## 3 FEMALE f
## 4 <NA> <NA>
## 5 FEMALE f
## 6 MALE m
## 7 FEMALE f
## 8 MALE m
## 9 <NA> <NA>
## 10 <NA> <NA>
## # ... with 334 more rows
case_whencase_when (from the dplyr package) is like ifelse but allows for much more complex combinations.
The basic setup for a case_when call looks like this:
case_when(
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(\color{orange}{\text{logical test}}\) ~ \(\color{blue}{\text{what should happen if TRUE}}\),
\(TRUE\) ~ \(\color{green}{\text{what should happen with everything else}}\),
)
The following code recodes a numeric vector (1 through 50) into three categorical ones:
x <- c(1:50)
x
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
case_when(
x %in% 1:10 ~ "1 through 10",
x %in% 1:30 ~ "11 through 30",
TRUE ~ "above 30"
)
## [1] "1 through 10" "1 through 10" "1 through 10" "1 through 10"
## [5] "1 through 10" "1 through 10" "1 through 10" "1 through 10"
## [9] "1 through 10" "1 through 10" "11 through 30" "11 through 30"
## [13] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [17] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [21] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [25] "11 through 30" "11 through 30" "11 through 30" "11 through 30"
## [29] "11 through 30" "11 through 30" "above 30" "above 30"
## [33] "above 30" "above 30" "above 30" "above 30"
## [37] "above 30" "above 30" "above 30" "above 30"
## [41] "above 30" "above 30" "above 30" "above 30"
## [45] "above 30" "above 30" "above 30" "above 30"
## [49] "above 30" "above 30"
Let’s use case_when in combination with mutate.
Creating the variable short_island which has a shorter label for island:
test <- mutate(penguins_clean,
island_short = case_when(
island == "Torgersen" ~ "T",
island == "Biscoe" ~ "B",
island == "Dream" ~ "D"
))
select(test, island, island_short)
## # A tibble: 344 x 2
## island island_short
## <chr> <chr>
## 1 Torgersen T
## 2 Torgersen T
## 3 Torgersen T
## 4 Torgersen T
## 5 Torgersen T
## 6 Torgersen T
## 7 Torgersen T
## 8 Torgersen T
## 9 Torgersen T
## 10 Torgersen T
## # ... with 334 more rows
With case_when you can also mix different variables making this a very powerful tool!
rename()Just changes the variable name but leaves all else intact:
rename(penguins_clean, sample = sample_number)
## # A tibble: 344 x 16
## study_name sample species latin_name island individual_id clutch_completi~
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie~ Pygosceli~ Torge~ N1A1 Yes
## 2 PAL0708 2 Adelie~ Pygosceli~ Torge~ N1A2 Yes
## 3 PAL0708 3 Adelie~ Pygosceli~ Torge~ N2A1 Yes
## 4 PAL0708 4 Adelie~ Pygosceli~ Torge~ N2A2 Yes
## 5 PAL0708 5 Adelie~ Pygosceli~ Torge~ N3A1 Yes
## 6 PAL0708 6 Adelie~ Pygosceli~ Torge~ N3A2 Yes
## 7 PAL0708 7 Adelie~ Pygosceli~ Torge~ N4A1 No
## 8 PAL0708 8 Adelie~ Pygosceli~ Torge~ N4A2 No
## 9 PAL0708 9 Adelie~ Pygosceli~ Torge~ N5A1 Yes
## 10 PAL0708 10 Adelie~ Pygosceli~ Torge~ N5A2 Yes
## # ... with 334 more rows, and 9 more variables: date_egg <date>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>
arrange()You can order your data to show the highest or lowest value first.
Let’s order by flipper_length_mm.
Lowest first:
arrange(penguins_clean, flipper_length_mm)
## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 29 Adelie~ Pygosceli~ Biscoe N18A1
## 2 PAL0708 21 Adelie~ Pygosceli~ Biscoe N11A1
## 3 PAL0910 123 Adelie~ Pygosceli~ Torge~ N67A1
## 4 PAL0708 31 Adelie~ Pygosceli~ Dream N21A1
## 5 PAL0708 32 Adelie~ Pygosceli~ Dream N21A2
## 6 PAL0809 99 Adelie~ Pygosceli~ Dream N50A1
## 7 PAL0708 7 Chinst~ Pygosceli~ Dream N66A1
## 8 PAL0708 48 Adelie~ Pygosceli~ Dream N29A2
## 9 PAL0708 12 Adelie~ Pygosceli~ Torge~ N6A2
## 10 PAL0708 22 Adelie~ Pygosceli~ Biscoe N11A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
Highest first using desc() (for descendant):
arrange(penguins_clean, desc(flipper_length_mm))
## # A tibble: 344 x 16
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0809 64 Gentoo~ Pygosceli~ Biscoe N19A2
## 2 PAL0708 2 Gentoo~ Pygosceli~ Biscoe N31A2
## 3 PAL0708 34 Gentoo~ Pygosceli~ Biscoe N56A2
## 4 PAL0809 66 Gentoo~ Pygosceli~ Biscoe N20A2
## 5 PAL0809 76 Gentoo~ Pygosceli~ Biscoe N56A2
## 6 PAL0910 90 Gentoo~ Pygosceli~ Biscoe N14A2
## 7 PAL0910 114 Gentoo~ Pygosceli~ Biscoe N34A2
## 8 PAL0910 116 Gentoo~ Pygosceli~ Biscoe N35A2
## 9 PAL0809 68 Gentoo~ Pygosceli~ Biscoe N51A2
## 10 PAL0910 112 Gentoo~ Pygosceli~ Biscoe N32A2
## # ... with 334 more rows, and 10 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>
group_by() and summarize()When you want to aggregate your data (by groups)
Sometimes we want to calculate group statistics.
In other languages this is often a pain.
With dplyr this is fairly easy and readable.
Let’s calculate the average culmen_length_mm for each sex.
First we group penguins_clean by sex.
grouped_by_sex <- group_by(penguins_clean, sex)
summarize works in a similar way to mutate:
variable_name = some_calculation
summarise(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## sex avg_culmen_length_mm
## <chr> <dbl>
## 1 FEMALE 42.1
## 2 MALE 45.9
## 3 <NA> 41.3
We could also keep the data structure by using mutate on a grouped dataset:
mutate(grouped_by_sex, avg_culmen_length_mm = mean(culmen_length_mm, na.rm = T))
## # A tibble: 344 x 17
## # Groups: sex [3]
## study_name sample_number species latin_name island individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie~ Pygosceli~ Torge~ N1A1
## 2 PAL0708 2 Adelie~ Pygosceli~ Torge~ N1A2
## 3 PAL0708 3 Adelie~ Pygosceli~ Torge~ N2A1
## 4 PAL0708 4 Adelie~ Pygosceli~ Torge~ N2A2
## 5 PAL0708 5 Adelie~ Pygosceli~ Torge~ N3A1
## 6 PAL0708 6 Adelie~ Pygosceli~ Torge~ N3A2
## 7 PAL0708 7 Adelie~ Pygosceli~ Torge~ N4A1
## 8 PAL0708 8 Adelie~ Pygosceli~ Torge~ N4A2
## 9 PAL0708 9 Adelie~ Pygosceli~ Torge~ N5A1
## 10 PAL0708 10 Adelie~ Pygosceli~ Torge~ N5A2
## # ... with 334 more rows, and 11 more variables: clutch_completion <chr>,
## # date_egg <date>, culmen_length_mm <dbl>, culmen_depth_mm <dbl>,
## # flipper_length_mm <dbl>, body_mass_g <dbl>, sex <chr>,
## # delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>, comments <chr>,
## # avg_culmen_length_mm <dbl>
count()Now this is a function that I use all the time.
This function helps you count how often a certain value occur(s) within variables(s).
Simply specify which variable you want to count.
Let’s count how often the species occur.
count(penguins_clean, species, sort = T)
## # A tibble: 3 x 2
## species n
## <chr> <int>
## 1 Adelie Penguin 152
## 2 Gentoo penguin 124
## 3 Chinstrap penguin 68
The sort = T tells the function to sort by the highest occuring frequency.
%>% operatorThe point of the pipe is to help you write code in a way that is easier to read and understand.
Let’s consider an example with some data manipulation we have done so far:
## first I select variables
pg <- select(penguins_clean, individual_id, island, body_mass_g)
## then I filter to only Dream island
pg <- filter(pg, island == "Dream")
## then I convert body_mass_g to kg
pg <- mutate(pg, bodymass_kg = body_mass_g/1000)
## rename individual id to simply id
pg <- rename(pg, id = individual_id)
Now this works but the problem is: we have to write a lot of code that repeats itself!
pg
## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
Another alternative is to nest all the functions:
rename(mutate(filter(select(penguins_clean, individual_id, island, body_mass_g), island == "Dream"), bodymass_kg = body_mass_g/1000), id = individual_id)
## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
But that’s extremely tough to read and understand!
The piping style:
Read from top to bottom and from left to right and the %>% as “and then”.
Data first, data once
penguins_clean %>%
select(individual_id, island, body_mass_g) %>%
filter(island == "Dream") %>%
mutate(bodymass_kg = body_mass_g/1000) %>%
rename(id = individual_id)
## # A tibble: 124 x 4
## id island body_mass_g bodymass_kg
## <chr> <chr> <dbl> <dbl>
## 1 N21A1 Dream 3250 3.25
## 2 N21A2 Dream 3900 3.9
## 3 N22A1 Dream 3300 3.3
## 4 N22A2 Dream 3900 3.9
## 5 N23A1 Dream 3325 3.32
## 6 N23A2 Dream 4150 4.15
## 7 N24A1 Dream 3950 3.95
## 8 N24A2 Dream 3550 3.55
## 9 N25A1 Dream 3300 3.3
## 10 N25A2 Dream 4650 4.65
## # ... with 114 more rows
group_by() againGrouping also become easier using pipes.
Let’s try again to calculate the average culmen_length_mm for each sex but this time with pipes.
penguins_clean %>%
group_by(sex) %>%
summarise(avg_culmen_length = mean(culmen_length_mm , na.rm = T))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## sex avg_culmen_length
## <chr> <dbl>
## 1 FEMALE 42.1
## 2 MALE 45.9
## 3 <NA> 41.3
Since R Version 4.1.0 Base R also provides a pipe.
It looks like this:
\[|>\]
While it shares many similarities with the %>% there are also some differences.
It’s beyond the scope of this workshop to go over it here but for the sake of simplicity we will stick with the magrittr pipe.
The following includes a list of exercises that you can complete on your own.
We are going to use the palmerpenguins dataset for the tasks ahead!
For reference, here is a list of some useful functions.
If you have trouble with any of these functions, try reading the documentation with ?function_name
Remember: all these functions take the data first.
filter()
mutate()
rename()
select()
summarise(); summarize()
group_by(); ungroup()
arrange()
count(); tally()
distinct()
pull()
ifelse()
case_when()
ifelse is not enough)separate()
pivot_wider()
pivot_longer()
Load the tidyverse and janitor packages.
If janitor is not installed yet (it will say janitor not found) install it.
Read in the already cleaned palmerpenguins dataset using
read_csvAssign the resulting data to penguins.
Then take a look a look at it using glimpse.
What kind of variables can you recognize?
Only keep the variables: species, island and sex.
Only keep variables 2 to 4.
Remove the column year.
Only include columns that contain “mm” in the variable name.
Rename island to location.
Filter the data so that species only includes Chinstrap.
Filter the data so that species only includes Chinstrap or Gentoo.
Filter the data so it includes only penguins that are male and of the species Adelie.
Create three new variables that calculates bill_length_mm and bill_depth_mm and flipper_length_mm from milimeter to centimeter.
Tip: divide the length value by 10.
Create a new variable called bill_depth_cat which has two values:
Create a new variable called species_short.
Adelie should become AChinstrap should become CGentoo should become GCalculate the average body_mass_g per island.
If you haven’t done so already, try using the %>% operator to do this.
Use the pipe operator (%>%) to do all the operations below.
penguins data so that it only includes Chinstrap or Adelie.sex to observed_sexspecies, observed_sex, bill_length_mm and bill_depth_mmbill_length_mm and bill_depth_mmTry to create the pipe step by step and execute code as you go to see if it works.
Once you are done, assign the data to new_penguins.
Calculate the average ratio by species and sex, again using pipes.
Count the number of penguins by island and species.
Below is a dataset that needs some cleaning.
Use the skills that you have learned so far to turn the data into a tidy dataset.
animal_friends <- tibble(
Names = c("Francis", "Catniss", "Theodor", "Eugenia"),
TheAnimals = c("Dog", "Cat", "Hamster", "Rabbit"),
Sex = c("m", "f", "m", "f"),
a_opterr = c("me", "me", "me", "me"),
`Age/Adopted/Condition` = c("8/2020/Very Good", "13/2019/Wild", "1/2021/Fair", "2/2020/Good")
)
Start here:
If you are done, turn the final data into long format.